104 research outputs found
Developing an Assessment Checklist
The goal of this project was to prepare an assessment checklist for the Information Retrieval (IR) course at the department of Computer Science in the academic year 2019/2020. This project was motivated by some observations regarding the previous edition of the IR course: in the 2018/2019 edition there was a clear mismatch between students and teachers expectations regarding the assignment. The students were struggling in understanding how to structure and write a good quality assignment. Furthermore, even if the students were instructed with guidelines on how to give feedback, they were struggling also in providing useful feedback to their peers. With the proposed assessment checklist, we aimed at guiding and helping students in structuring their assignment and peer reviews. This paper is organised as follows: Section 1 describes the course during which the project was carried out; Section 2 presents the project goals and motivations; Section 3 carefully describes how the project was conducted; Section 4 reports some analysis about the project results; and Section 5 presents conclusions and future challenges
Exploiting user signals and stochastic models to improve information retrieval systems and evaluation
The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances
Learning Recommendations from User Actions in the Item-poor Insurance Domain
While personalised recommendations are successful in domains like retail,
where large volumes of user feedback on items are available, the generation of
automatic recommendations in data-sparse domains, like insurance purchasing, is
an open problem. The insurance domain is notoriously data-sparse because the
number of products is typically low (compared to retail) and they are usually
purchased to last for a long time. Also, many users still prefer the telephone
over the web for purchasing products, reducing the amount of web-logged user
interactions. To address this, we present a recurrent neural network
recommendation model that uses past user sessions as signals for learning
recommendations. Learning from past user sessions allows dealing with the data
scarcity of the insurance domain. Specifically, our model learns from several
types of user actions that are not always associated with items, and unlike all
prior session-based recommendation models, it models relationships between
input sessions and a target action (purchasing insurance) that does not take
place within the input sessions. Evaluation on a real-world dataset from the
insurance domain (ca. 44K users, 16 items, 54K purchases, and 117K sessions)
against several state-of-the-art baselines shows that our model outperforms the
baselines notably. Ablation analysis shows that this is mainly due to the
learning of dependencies across sessions in our model. We contribute the first
ever session-based model for insurance recommendation, and make available our
dataset to the research community
improving information retrieval evaluation via markovian user models and visual analytics
To address the challenge of adapting experimental evaluation to the constantly evolving user tasks and needs, we develop a new family of Markovian Information Retrieval (IR) evaluation measures, called Markov Precision (MP), where the interaction between the user and the ranked result list is modelled via Markov chains, and which will be able to explicitly link lab-style and on-line evaluation methods. Moreover, since experimental results are often not so easy to understand, we will develop a Web-based Visual Analytics (VA) prototype where an animated state diagram of the Markov chain will explain how the user is interacting with the ranked result list in order to offer a support for a careful failure analysis
Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study
Fairness is an emerging and challenging topic in recommender systems. In
recent years, various ways of evaluating and therefore improving fairness have
emerged. In this study, we examine existing evaluation measures of fairness in
recommender systems. Specifically, we focus solely on exposure-based fairness
measures of individual items that aim to quantify the disparity in how
individual items are recommended to users, separate from item relevance to
users. We gather all such measures and we critically analyse their theoretical
properties. We identify a series of limitations in each of them, which
collectively may render the affected measures hard or impossible to interpret,
to compute, or to use for comparing recommendations. We resolve these
limitations by redefining or correcting the affected measures, or we argue why
certain limitations cannot be resolved. We further perform a comprehensive
empirical analysis of both the original and our corrected versions of these
fairness measures, using real-world and synthetic datasets. Our analysis
provides novel insights into the relationship between measures based on
different fairness concepts, and different levels of measure sensitivity and
strictness. We conclude with practical suggestions of which fairness measures
should be used and when. Our code is publicly available. To our knowledge, this
is the first critical comparison of individual item fairness measures in
recommender systems.Comment: Accepted to ACM Transactions on Recommender Systems (TORS
University of Copenhagen Participation in TREC Health Misinformation Track 2020
In this paper, we describe our participation in the TREC Health
Misinformation Track 2020. We submitted runs to the Total Recall Task and
13 runs to the Ad Hoc task. Our approach consists of 3 steps: (1) we create an
initial run with BM25 and RM3; (2) we estimate credibility and misinformation
scores for the documents in the initial run; (3) we merge the relevance,
credibility and misinformation scores to re-rank documents in the initial run.
To estimate credibility scores, we implement a classifier which exploits
features based on the content and the popularity of a document. To compute the
misinformation score, we apply a stance detection approach with a pretrained
Transformer language model. Finally, we use different approaches to merge
scores: weighted average, the distance among score vectors and rank fusion
Basis of a Formal Framework for Information Retrieval Evaluation Measurements
Abstract. In this paper we present a formal framework, based on the representational theory of measurement and we define and study the properties of utility-oriented measurements of retrieval effectiveness like AP, RBP, ERR and many other popular IR evaluation measures
A importância do setor sucroalcooleiro e suas relações com a estrutura produtiva da economia
In an economic context in which the state is reducing its role in the economy, the agents involved with the Sugar Cane and Alcohol sector, usually highly dependent on government policies, have to change their behavior so they can operate in a competitive market without the benefits from the state. Therefore, an analysis of the economic relationships between this sector and the economic structure of Brazil would help to define how this sector could change to the new economic conditions. As such, the goals of this paper are to identify: i) the importance, in terms of backward and forward linkages, of the Sugar Cane and Alcohol sector in the economy, using the concepts of the Hirschman/Rasmussen
Indexes (HR) and Pure Linkage Indexes; ii) how changes in the use coefficients of Sugar Cane and Alcohol products, by the sectors that use then as inputs, would spread through out the economy, using the Field of Influence approach; iii) the majors relationships in the economy. The data used in this paper refers to the Brazilian input-output tables construct for the years of 1985, 1992 and 1995 at the level of 34 sectors. The major findings for the HR indexes show that the importance of the Sugar Cane and Alcohol sector, in terms of productive links, practically had no change in the 1985/1995 period. The results for the Normalized Pure Linkages, that show the relative importance of the sector in terms of production generation, show that the sector has improved its position in the 1985 to 1992 period, decreasing in the following one, i.e., 1992 to 1995
Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study
Medical coding is the task of assigning medical codes to clinical free-text
documentation. Healthcare professionals manually assign such codes to track
patient diagnoses and treatments. Automated medical coding can considerably
alleviate this administrative burden. In this paper, we reproduce, compare, and
analyze state-of-the-art automated medical coding machine learning models. We
show that several models underperform due to weak configurations, poorly
sampled train-test splits, and insufficient evaluation. In previous work, the
macro F1 score has been calculated sub-optimally, and our correction doubles
it. We contribute a revised model comparison using stratified sampling and
identical experimental setups, including hyperparameters and decision boundary
tuning. We analyze prediction errors to validate and falsify assumptions of
previous works. The analysis confirms that all models struggle with rare codes,
while long documents only have a negligible impact. Finally, we present the
first comprehensive results on the newly released MIMIC-IV dataset using the
reproduced models. We release our code, model parameters, and new MIMIC-III and
MIMIC-IV training and evaluation pipelines to accommodate fair future
comparisons.Comment: 11 pages, 6 figures, to be published in Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR '23), July 23--27, 2023, Taipei, Taiwa
Graph-based Recommendation for Sparse and Heterogeneous User Interactions
Recommender system research has oftentimes focused on approaches that operate
on large-scale datasets containing millions of user interactions. However, many
small businesses struggle to apply state-of-the-art models due to their very
limited availability of data. We propose a graph-based recommender model which
utilizes heterogeneous interactions between users and content of different
types and is able to operate well on small-scale datasets. A genetic algorithm
is used to find optimal weights that represent the strength of the relationship
between users and content. Experiments on two real-world datasets (which we
make available to the research community) show promising results (up to 7%
improvement), in comparison with other state-of-the-art methods for low-data
environments. These improvements are statistically significant and consistent
across different data samples
- …